class: center, middle, inverse, title-slide # Lecture 4 ## Statistical Models and Notation ### Psych 10 C ### University of California, Irvine ### 04/06/2022 --- ## Objective in research - One of our main goals in research is to be able to contrast our beliefs about the world against the outcomes of experiments. -- - We start with a "verbal" statement that captures our beliefs about the world, which we then formalize into a statistical model. -- - Statistical models allow us to make predictions about future observations. In the case of an experiment, they allow us to make predictions about the outcomes. -- - We evaluate these predictions by comparing them with the outcomes (data) of the experiment. -- - Finally, we interpret the results of our evaluation with respect to our original beliefs or statements about the world. --- ## Statistical models - Statistical models are abstract representations of the world. -- - They are formalizations of our beliefs about probabilistic events. -- - For example, if we have an experiment where we throw a coin, we have two competing ideas about the coin: -- - The coin is **fair**. -- - The coin is **not fair**. -- - We can formalize these two beliefs into a statistical model. -- - The coin is fair: `\(P(\{heads\})\ =\ P(\{tails\})\ =\ 0.5\)` -- - The coin is not fair: `\(P(\{heads\})\ \neq\ P(\{tails\})\)` -- - We moved from verbal statements of our beliefs regarding the coin to formal statements about the probability of observing "heads" or "tails". --- ## Statistical Models - Statistical models are the formal representation of our beliefs or hypotheses about the outcomes of an experiment. -- - Given that we assume that the outcomes are probabilistic, our models will have a probabilistic component associated to them. -- - Given the nature of our observations it will be almost impossible for us to tell if a model is TRUE or FALSE. However, we can compare how useful they are on a given situation. -- - Statistical models allow us to make predictions about our observations, which we can use to determine how useful these models are. -- - Before we continue, it will be useful to introduce some notation! -- - Notation provides us with a way to express our models in a formal and standard way. --- class: inverse, middle, center # Notation --- ## Example: - To introduce notation, we will start with a problem. -- - **Problem:** We want to know if people who smoke have different lung capacity than people who don't smoke. -- - We have a variable that we are interested in: lung capacity. Let's assume it's measured with some standard test. -- - We also have a variable that indicates if a given participant smokes or not. -- - We call the first one a **dependent** variable, because we want to see how it "depends" on the values of other variables. -- - We call the smoker indicator variable an **independent** variable. We are interested in how our independent variable affects the values of our dependent variable. -- - In other words, we want to determine if lung capacity is a function of smoking status. --- ## Example: Smoking - We collect data from 8 participants, 4 smokers and 4 non smokers. --
-- - We will denote specific values of our dependent variables using `\(y_{ij}\)`. For example, the first observation of our first group (non-smokers) is denoted as `\(y_{11}\)` while the fourth observation of the same group is denoted `\(y_{41}\)` --- ## Example: Smoking - In general we say that the *i-th* observation of the *j-th* group is denoted as `\(y_{ij}\)`. Note that the letters `\(i\)` and `\(j\)` are a way to denote a general observation, if we want to look at a particular one we can write: -- - `\(y_{21}=\)` 79 - `\(y_{32}=\)` 70.4 -- - Now, we have a notation for our observations! -- - Remember that our objective is to formalize our beliefs or hypotheses about the world. -- - We know that our observations are probabilistic, so we need a way to describe their variability. -- - In order to do this we will use the normal distribution. --- class: inverse, middle, center # The Normal (Gaussian) Distribution --- ## The Normal distribution - The normal distribution is one of the most used statistical models in the literature. -- - One of its main advantages is that it can be described using only two parameters, `\(\mu\)` and `\(\sigma^2\)`. -- - We denote the Normal distribution as `\(\text{Normal}(\mu,\sigma^2)\)`. --- ## Standard Normal distribution - `\(\text{Normal}(\mu = 0,\sigma^2 = 1)\)` <img src="data:image/png;base64,#lec-4_files/figure-html/norm-examp-1.png" style="display: block; margin: auto;" /> --- ## Normal distribution - The first parameter of the normal distribution `\(\mu\)` represents the center of the distribution. Notice that this is the value that has the highest density. -- - The second parameter `\(\sigma^2\)` (or `\(\sigma\)`) controls the dispersion of the normal distribution: -- - For example, two normal distributions with the same variance `\(\sigma^2\)` can be drawn in R using: .pull-left[ ```r par(mai = c(1,0.1,0.1,0.1)) curve(dnorm(x, mean = 0, sd = 1), from = -4, to = 6, axes = FALSE, ann = FALSE, col = "red") curve(dnorm(x, mean = 2, sd = 1), col = "blue", add = T) box(bty = "l") axis(1, cex.axis = 1.3) mtext(text = "Support", side = 1, line = 2, cex = 1.6) ``` ] .pull-right[ <img src="data:image/png;base64,#lec-4_files/figure-html/norm-ex-1-out-1.png" style="display: block; margin: auto;" /> ] --- ## Normal distribution - An example of two normal distributions with the same value of `\(\mu\)` but different `\(\sigma^2\)` would be: .pull-left[ ```r par(mai = c(1,0.1,0.1,0.1)) curve(dnorm(x, mean = 0, sd = 1), from = -9, to = 9, axes = FALSE, ann = FALSE, col = "red") curve(dnorm(x, mean = 0, sd = 3), col = "blue", add = T) box(bty = "l") axis(1, cex.axis = 1.3) mtext(text = "Support", side = 1, line = 2, cex = 1.6) ``` ] .pull-right[ <img src="data:image/png;base64,#lec-4_files/figure-html/norm-ex-2-out-1.png" style="display: block; margin: auto;" /> ] -- - As we can see in the plot, as the standard deviation `\(\sigma\)` increases from 1 to 3, the normal distribution becomes wider and less tall. -- - In other words, `\(\sigma^2\)` (or `\(\sigma\)`) control the variability of the distribution. --- ## Note - Once we have assigned a value to our parameters `\(\mu\)` and `\(\sigma^2\)` (R uses `\(\sigma\)`), we have completely defined a Normal distribution. -- - With this, we can know the density (i.e. height of the curve) assigned to each value of the random variable. -- - Another important thing to note is that both parameters are independent. They control different aspects of the function. For example, we can have two Normal distributions that have the same variance and centered at different points. -- In teams of 3 students, draw two Normal distributions that have the same `\(\sigma^2\)` but different `\(\mu\)`'s. Then draw two Normal distributions with the same `\(\mu\)` but different `\(\sigma^2\)`'s. --- ## Statistical models - Now, we have a notation for our observations `\(y_{ij}\)`. -- - And we found a statistical model in the Normal distribution. -- - We can start formalizing our hypotheses about the outcomes of an experiment. -- - Let's go back to the smoking example... --- ## Smoking: Null model - Reminder: We want to know if people who smoke have different lung capacity than people who do not smoke. -- - We tested the lung capacity of 8 participants, 4 non-smokers and 4 smokers. -- - We will denote each observation of the non-smokers' group as `\(y_{11},y_{21},y_{31},y_{41}\)` and each observation of our smokers' group as `\(y_{12},y_{22},y_{32},y_{42}\)`. -- - For short, we can say that we denote with `\(i = 1,\dots,4\)` the observation number on group `\(j = 1,2\)` where `\(j = 1\)` corresponds to the non-smokers. -- - Both statements give us the same information, however, the second one is shorter. -- - Imagine if we had 50 observations in each group, listing all of them would take us a page... --- ## Smoking: Null model - Now we can think of two hypotheses, -- - **First:** There are **no differences** between non-smokers and smokers. -- - In other words, that even when our observations have some variability, lung capacity is not a function of smoking status. -- - **Second:** Lung capacity is a function of smoking status (and the one that we might be more interested in). -- - In other words, that the **groups are different**. -- - The first model is known as the **NULL Model**: A model that states that there are **no differences** in lung capacity between groups! -- - We denote this model in the following way: `$$y_{ij} \sim \text{Normal}(\mu,\sigma^2)$$` --- ## Smoking: Null model - The **Null Model** formalizes the assumption that, regardless of the observation number `\(i\)`, and the group `\(j\)` (smoking status), they are all described by the same parameters. -- - In other words, even though there might be some variability in our measures of lung capacity, observations collected for non-smokers and smokers come from the same distribution! -- - A single Normal distribution centered at some value `\(\mu\)` and which has some variability `\(\sigma^2\)`. -- - Notice that we don't specify the values of `\(\mu\)` and `\(\sigma^2\)` that define our statistical model. However, we can still graph what the Null Model expects the data to look like. --- ## Graphical representation of the Null Model .pull-left[ ```r par(mai = c(1,0.1,0.1,0.1)) curve(dnorm(x, mean = 0, sd = 1), from = -4, to = 6, axes = FALSE, ann = FALSE, col = "red", lwd = 3) curve(dnorm(x, mean = 0, sd = 1), col = "blue", add = T, lty = 3, lwd = 3) box(bty = "l") mtext(text = "Lung capacity", side = 1, line = 2, cex = 1.6) legend("topleft", bty = "n", col = c("blue","red"), legend = c("non-smokers", "smokers"), lty = c(3, 1)) ``` ] .pull-right[ <img src="data:image/png;base64,#lec-4_files/figure-html/normal-null-out-1.png" style="display: block; margin: auto;" /> ] -- - Notice that we have not added any numbers on the x axis. This is because we don't know the values of the distribution, however, given the specification of the model, we know that it expects all of our observations to come from the same distribution! --- ## Statistical Inferece - Once we have defined our model, our new objective will be to find some suitable values for the parameters `\(\mu\)` and `\(\sigma^2\)` to better define our statistical model. -- - This is known as Statistical Inference, and it will be the main objective of this class. -- - In general we can say that Statistical Inference refers to the process by which we "infer" or learn the values of the parameters of a statistical model based on our observations of the world (Data). --- # Data files for homework 2 - Link for data Section 1 homework 2: ```r link_1 <- "https://raw.githubusercontent.com/ManuelVU/psych-10c-data/main/hw-2-problem-1.csv" ``` - Link for data Section 2 homework 2: ```r link_2 <- "https://raw.githubusercontent.com/ManuelVU/psych-10c-data/main/hw-2-problem-2.csv" ```